-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change Entity::generation from u32 to NonZeroU32 for niche optimization #9907
Change Entity::generation from u32 to NonZeroU32 for niche optimization #9907
Conversation
We might want to write a little migration script that increases the generation of every entity in a scene file by one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
appears sensible, should probably merge after #9797 though
Do we have any prior art for this? Be good to look at before I give it a go |
Seems like this'll need a rework after that lands, which is fine - I'll keep my eyes out for it. I'll look at doing the benchmarking after that's in and I fix things up. |
If you want to do the usual wrapping for addition of two integers, here is how (playground) Proof that it works: adding two integers will result in exactly 0 or 1 overflows (2 × uN::MAX = (oveflow_flag, uN::MAX - 1)). The case with 0 overflows is trivial (just add them normally). If overflow happens, the result would be off-by-one with the normal wrapping. There are two extreme cases: if result was uN::MIN, adding 1 will make it NonZeroUN::MIN. If result was uN::MAX - 1, adding 1 will make it NonZeroUN::MAX. Thus, neither underflow nor overflow of NonZero domain is possible. There's also a commented out simple test for Miri, albeit it needs to be run locally. Tl;dr: it's as simple as (x + y + overflow_flag) |
@Kolsky Oh, that's really excellent :O Thanks for that! |
The optimized assembly is also really clean, nice! |
Still waiting on that other PR to merge, but I ran bench with 857fb9c: Cut down table excluding < 5% changes grepped for "entit":
|
I'm not too familiar with those numbers, but they seem rather inconsistent between different magnitudes of test: Shut everything down on the machine except the terminal running the test. |
Yeah, the benchmarks are definitely a bit noisy, both here and in general 🤔 I can't make clear sense of this unfortunately. |
Elabajaba on discord pointed out that since I'm on a Zen 3 CPU, I'm probably getting different boosts on every core. I'll try disabling that and running the tests again, see if I get something more stable and less noisey. |
Yep, that was it. I disabled Precision Boost and locked my CPU scaling to 2.2GHz on linux. Much more stable, not as much swing between tests, still a little noisey. But I think maybe i need to run a few more tests, because the results are little strange in places (30% gain on events_iter??) Ran against bb13d06 on main, and merged into the branch, threshold of 5%:
Seems like gains on |
@@ -147,7 +149,7 @@ mod tests { | |||
let mut world = World::new(); | |||
let mut mapper = EntityMapper::new(&mut map, &mut world); | |||
|
|||
let mapped_ent = Entity::new(FIRST_IDX, 0); | |||
let mapped_ent = Entity::new(FIRST_IDX, 1).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this file could just use from_raw
instead of the cfg(test)
-only function? And that'd save some unwrap
clutter too:
let mapped_ent = Entity::new(FIRST_IDX, 1).unwrap(); | |
let mapped_ent = Entity::from_raw(FIRST_IDX); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent point, I can also make this change in the lib.rs
tests in a few places.
@@ -156,7 +158,9 @@ mod tests { | |||
"should persist the allocated mapping from the previous line" | |||
); | |||
assert_eq!( | |||
mapper.get_or_reserve(Entity::new(SECOND_IDX, 0)).index(), | |||
mapper | |||
.get_or_reserve(Entity::new(SECOND_IDX, 1).unwrap()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.get_or_reserve(Entity::new(SECOND_IDX, 1).unwrap()) | |
.get_or_reserve(Entity::from_raw(SECOND_IDX)) |
@@ -191,7 +196,7 @@ impl Entity { | |||
pub const fn from_raw(index: u32) -> Entity { | |||
Entity { | |||
index, | |||
generation: 0, | |||
generation: NonZeroU32::MIN, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this function's rustdoc says "and a generation of 0", so that probably needs to be updated.
@@ -174,7 +178,7 @@ mod tests { | |||
let mut world = World::new(); | |||
|
|||
let dead_ref = EntityMapper::world_scope(&mut map, &mut world, |_, mapper| { | |||
mapper.get_or_reserve(Entity::new(0, 0)) | |||
mapper.get_or_reserve(Entity::new(0, 1).unwrap()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mapper.get_or_reserve(Entity::new(0, 1).unwrap()) | |
mapper.get_or_reserve(Entity::from_raw(0)) |
(This is my first PR here, so I've probably missed some things. Please let me know what else I should do to help you as a reviewer!) # Objective Due to rust-lang/rust#117800, the `derive`'d `PartialEq::eq` on `Entity` isn't as good as it could be. Since that's used in hashtable lookup, let's improve it. ## Solution The derived `PartialEq::eq` short-circuits if the generation doesn't match. However, having a branch there is sub-optimal, especially on 64-bit systems like x64 that could just load the whole `Entity` in one load anyway. Due to complications around `poison` in LLVM and the exact details of what unsafe code is allowed to do with reference in Rust (rust-lang/unsafe-code-guidelines#346), LLVM isn't allowed to completely remove the short-circuiting. `&Entity` is marked `dereferencable(8)` so LLVM knows it's allowed to *load* all 8 bytes -- and does so -- but it has to assume that the `index` might be undef/poison if the `generation` doesn't match, and thus while it finds a way to do it without needing a branch, it has to do something slightly more complicated than optimal to combine the results. (LLVM is allowed to change non-short-circuiting code to use branches, but not the other way around.) Here's a link showing the codegen today: <https://rust.godbolt.org/z/9WzjxrY7c> ```rust #[no_mangle] pub fn demo_eq_ref(a: &Entity, b: &Entity) -> bool { a == b } ``` ends up generating the following assembly: ```asm demo_eq_ref: movq xmm0, qword ptr [rdi] movq xmm1, qword ptr [rsi] pcmpeqd xmm1, xmm0 pshufd xmm0, xmm1, 80 movmskpd eax, xmm0 cmp eax, 3 sete al ret ``` (It's usually not this bad in real uses after inlining and LTO, but it makes a strong demo.) This PR manually implements `PartialEq::eq` *without* short-circuiting, and because that tells LLVM that neither the generations nor the index can be poison, it doesn't need to be so careful and can generate the "just compare the two 64-bit values" code you'd have probably already expected: ```asm demo_eq_ref: mov rax, qword ptr [rsi] cmp qword ptr [rdi], rax sete al ret ``` Since this doesn't change the representation of `Entity`, if it's instead passed by *value*, then each `Entity` is two `u32` registers, and the old and the new code do exactly the same thing. (Other approaches, like changing `Entity` to be `[u32; 2]` or `u64`, affect this case.) This should hopefully merge easily with changes like #9907 that also want to change `Entity`. ## Benchmarks I'm not super-confident that I got my machine fully consistent for benchmarking, but whether I run the old or the new one first I get reasonably consistent results. Here's a fairly typical example of the benchmarks I added in this PR: ![image](https://github.com/bevyengine/bevy/assets/18526288/24226308-4616-4082-b0ff-88fc06285ef1) Building the sets seems to be basically the same. It's usually reported as noise, but sometimes I see a few percent slower or faster. But lookup hits in particular -- since a hit checks that the key is equal -- consistently shows around 10% improvement. `cargo run --example many_cubes --features bevy/trace_tracy --release -- --benchmark` showed as slightly faster with this change, though if I had to bet I'd probably say it's more noise than meaningful (but at least it's not worse either): ![image](https://github.com/bevyengine/bevy/assets/18526288/58bb8c96-9c45-487f-a5ab-544bbfe9fba0) This is my first PR here -- and my first time running Tracy -- so please let me know what else I should run, or run things on your own more reliable machines to double-check. --- ## Changelog (probably not worth including) Changed: micro-optimized `Entity::eq` to help LLVM slightly. ## Migration Guide (I really hope nobody was using this on uninitialized entities where sufficiently tortured `unsafe` could could technically notice that this has changed.)
While I have no idea if that change is meaningful, note that in a microbenchmark dealing with iterators where the item type is define void @next1(ptr noalias nocapture noundef writeonly sret(%"core::option::Option<Entity1>") align 4 dereferenceable(12) %_0, ptr noalias nocapture noundef readonly align 4 dereferenceable(4) %it) unnamed_addr #0 { where %"core::option::Option<Entity1>" = type { i32, [2 x i32] } vs with the niche it's a simple pair that can be returned directly (without going through stack) define { i32, i32 } @next2(ptr noalias nocapture noundef readonly align 4 dereferenceable(4) %it) unnamed_addr #1 { Hopefully inlining and LTO and such would remove the effects of those differences, but sometimes zero-cost abstractions aren't (2-variant enums are sometimes worse than one would like -- see rust-lang/rust#85133 (comment) -- but once niched that stops happening.) |
let (lo, hi) = lhs.get().overflowing_add(rhs); | ||
let ret = lo + hi as u32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, clever! This codegens really elegantly; nice work 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kolsky up above came up with it:
#9907 (comment)
I was really surprised at it's elegance, definitely going to be using it in one or two places in my personal projects now that i know about it. I think, unfortunately, given my reading of:
#9797 (comment)
it means we won't be able to keep it (I think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think it we will be able to do something similar, but we'd need to wrap on a however many bits we have available for the generation segment. So at the moment, that would be 31 bits. So you'd have to do something like:
pub const fn nonzero_wrapping_high_increment(value: NonZeroU32) -> NonZeroU32 {
let next_value = value.get().wrapping_add(1);
// Mask the overflow bit
let overflowed = (next_value & 0x8000_0000) >> 31;
// Remove the overflow bit from the next value, but then add to it
unsafe { NonZeroU32::new_unchecked((next_value & 0x7FFF_FFFF) + overflowed) }
}
As long as we know we are only incrementing by one each time, then it should still output fairly terse asm: https://rust.godbolt.org/z/PnYTPfGb6 Basically the same principle as before, just applied to wrapping on 31 bits (or whatever amount of bits we need later)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, that's interesting 🤔 We can definitely use that for the regular increment, it was just nice that this version worked for both slot incrementing and Entities::reserve_generations
(which requires an arbitrary increment, but honestly might not even be correct).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried a version with checked_add
, it has a similar instruction count, but is probably more expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I don't really have anything to add since we discussed most concerns in the linked Discord discussion. It's cool that we even avoided adding a branch to free
.
(This is my first PR here, so I've probably missed some things. Please let me know what else I should do to help you as a reviewer!) # Objective Due to rust-lang/rust#117800, the `derive`'d `PartialEq::eq` on `Entity` isn't as good as it could be. Since that's used in hashtable lookup, let's improve it. ## Solution The derived `PartialEq::eq` short-circuits if the generation doesn't match. However, having a branch there is sub-optimal, especially on 64-bit systems like x64 that could just load the whole `Entity` in one load anyway. Due to complications around `poison` in LLVM and the exact details of what unsafe code is allowed to do with reference in Rust (rust-lang/unsafe-code-guidelines#346), LLVM isn't allowed to completely remove the short-circuiting. `&Entity` is marked `dereferencable(8)` so LLVM knows it's allowed to *load* all 8 bytes -- and does so -- but it has to assume that the `index` might be undef/poison if the `generation` doesn't match, and thus while it finds a way to do it without needing a branch, it has to do something slightly more complicated than optimal to combine the results. (LLVM is allowed to change non-short-circuiting code to use branches, but not the other way around.) Here's a link showing the codegen today: <https://rust.godbolt.org/z/9WzjxrY7c> ```rust #[no_mangle] pub fn demo_eq_ref(a: &Entity, b: &Entity) -> bool { a == b } ``` ends up generating the following assembly: ```asm demo_eq_ref: movq xmm0, qword ptr [rdi] movq xmm1, qword ptr [rsi] pcmpeqd xmm1, xmm0 pshufd xmm0, xmm1, 80 movmskpd eax, xmm0 cmp eax, 3 sete al ret ``` (It's usually not this bad in real uses after inlining and LTO, but it makes a strong demo.) This PR manually implements `PartialEq::eq` *without* short-circuiting, and because that tells LLVM that neither the generations nor the index can be poison, it doesn't need to be so careful and can generate the "just compare the two 64-bit values" code you'd have probably already expected: ```asm demo_eq_ref: mov rax, qword ptr [rsi] cmp qword ptr [rdi], rax sete al ret ``` Since this doesn't change the representation of `Entity`, if it's instead passed by *value*, then each `Entity` is two `u32` registers, and the old and the new code do exactly the same thing. (Other approaches, like changing `Entity` to be `[u32; 2]` or `u64`, affect this case.) This should hopefully merge easily with changes like bevyengine#9907 that also want to change `Entity`. ## Benchmarks I'm not super-confident that I got my machine fully consistent for benchmarking, but whether I run the old or the new one first I get reasonably consistent results. Here's a fairly typical example of the benchmarks I added in this PR: ![image](https://github.com/bevyengine/bevy/assets/18526288/24226308-4616-4082-b0ff-88fc06285ef1) Building the sets seems to be basically the same. It's usually reported as noise, but sometimes I see a few percent slower or faster. But lookup hits in particular -- since a hit checks that the key is equal -- consistently shows around 10% improvement. `cargo run --example many_cubes --features bevy/trace_tracy --release -- --benchmark` showed as slightly faster with this change, though if I had to bet I'd probably say it's more noise than meaningful (but at least it's not worse either): ![image](https://github.com/bevyengine/bevy/assets/18526288/58bb8c96-9c45-487f-a5ab-544bbfe9fba0) This is my first PR here -- and my first time running Tracy -- so please let me know what else I should run, or run things on your own more reliable machines to double-check. --- ## Changelog (probably not worth including) Changed: micro-optimized `Entity::eq` to help LLVM slightly. ## Migration Guide (I really hope nobody was using this on uninitialized entities where sufficiently tortured `unsafe` could could technically notice that this has changed.)
Since #9907 the generation starts at `1` instead of `0` so `Entity::to_bits` now returns `4294967296` (ie. `u32::MAX + 1`) as the lowest number instead of `0`. Without this change scene loading fails with this error message: `ERROR bevy_asset::server: Failed to load asset 'scenes/load_scene_example.scn.ron' with asset loader 'bevy_scene::scene_loader::SceneLoader': Could not parse RON: 8:6: Invalid generation bits`
Couldn't find information on this in the migration guide |
It looks like your PR is a breaking change, but you didn't provide a migration guide. Could you add some context on what users should update when this change get released in a new version of Bevy? |
I'm deeply sorry for the inconvenience I've caused with this. I'll look into some solutions. |
# Objective Adoption of #2104 and #11843. The `Option<usize>` wastes 3-7 bytes of memory per potential entry, and represents a scaling memory overhead as the ID space grows. The goal of this PR is to reduce memory usage without significantly impacting common use cases. Co-Authored By: @NathanSWard Co-Authored By: @tygyh ## Solution Replace `usize` in `SparseSet`'s sparse array with `nonmax::NonMaxUsize`. NonMaxUsize wraps a NonZeroUsize, and applies a bitwise NOT to the value when accessing it. This allows the compiler to niche the value and eliminate the extra padding used for the `Option` inside the sparse array, while moving the niche value from 0 to usize::MAX instead. Checking the [diff in x86 generated assembly](james7132/bevy_asm_tests@6e4da65), this change actually results in fewer instructions generated. One potential downside is that it seems to have moved a load before a branch, which means we may be incurring a cache miss even if the element is not there. Note: unlike #2104 and #11843, this PR only targets the metadata stores for the ECS and not the component storage itself. Due to #9907 targeting `Entity::generation` instead of `Entity::index`, `ComponentSparseSet` storing only up to `u32::MAX` elements would become a correctness issue. This will come with a cost when inserting items into the SparseSet, as now there is a potential for a panic. These cost are really only incurred when constructing a new Table, Archetype, or Resource that has never been seen before by the World. All operations that are fairly cold and not on any particular hotpath, even for command application. --- ## Changelog Changed: `SparseSet` now can only store up to `usize::MAX - 1` elements instead of `usize::MAX`. Changed: `SparseSet` now uses 33-50% less memory overhead per stored item.
# Objective Adoption of bevyengine#2104 and bevyengine#11843. The `Option<usize>` wastes 3-7 bytes of memory per potential entry, and represents a scaling memory overhead as the ID space grows. The goal of this PR is to reduce memory usage without significantly impacting common use cases. Co-Authored By: @NathanSWard Co-Authored By: @tygyh ## Solution Replace `usize` in `SparseSet`'s sparse array with `nonmax::NonMaxUsize`. NonMaxUsize wraps a NonZeroUsize, and applies a bitwise NOT to the value when accessing it. This allows the compiler to niche the value and eliminate the extra padding used for the `Option` inside the sparse array, while moving the niche value from 0 to usize::MAX instead. Checking the [diff in x86 generated assembly](james7132/bevy_asm_tests@6e4da65), this change actually results in fewer instructions generated. One potential downside is that it seems to have moved a load before a branch, which means we may be incurring a cache miss even if the element is not there. Note: unlike bevyengine#2104 and bevyengine#11843, this PR only targets the metadata stores for the ECS and not the component storage itself. Due to bevyengine#9907 targeting `Entity::generation` instead of `Entity::index`, `ComponentSparseSet` storing only up to `u32::MAX` elements would become a correctness issue. This will come with a cost when inserting items into the SparseSet, as now there is a potential for a panic. These cost are really only incurred when constructing a new Table, Archetype, or Resource that has never been seen before by the World. All operations that are fairly cold and not on any particular hotpath, even for command application. --- ## Changelog Changed: `SparseSet` now can only store up to `usize::MAX - 1` elements instead of `usize::MAX`. Changed: `SparseSet` now uses 33-50% less memory overhead per stored item.
Objective
Option<Entity>
to use memory layout optimization #3022Discussion
Option<Entity>
to leverage niche optimization #3029Solution
Entity::generation
from u32 to NonZeroU32 to allow for niche optimization.ChildOf
,Owes
), these will be able to be fast lookups, and the remainder of the range can use slower lookups to map to the address space.from_bits
type methods.EntityMeta
was changed to matchMigration
bevy_scene
serializations have a high likelihood of being broken, as they contain 0th generation entities.Current Issues
Entities::reserve_generations
andEntityMapper
wrap now, even in debug - although they technically did in release mode already so this probably isn't a huge issue. It just depends if we need to change anything here?